A population-based study exploring phenotypic clusters and clinical outcomes in stroke using unsupervised machine learning approach

Individuals developing stroke have varying clinical characteristics, demographic, and biochemical profiles. This heterogeneity in phenotypic characteristics can impact on cardiovascular disease (CVD) morbidity and mortality outcomes. This study uses a novel clustering approach to stratify individuals with incident stroke into phenotypic clusters and evaluates the differential burden of recurrent stroke and other cardiovascular outcomes. We used linked clinical data from primary care, hospitalisations, and death records in the UK. A data-driven clustering analysis (kamila algorithm) was used in 48,114 patients aged ≥ 18 years with incident stroke, from 1-Jan-1998 to 31-Dec-2017 and no prior history of serious vascular events. Cox proportional hazards regression was used to estimate hazard ratios (HRs) for subsequent adverse outcomes, for each of the generated clusters. Adverse outcomes included coronary heart disease (CHD), recurrent stroke, peripheral vascular disease (PVD), heart failure, CVD-related and all-cause mortality. Four distinct phenotypes with varying underlying clinical characteristics were identified in patients with incident stroke. Compared with cluster 1 (n = 5,201, 10.8%), the risk of composite recurrent stroke and CVD-related mortality was higher in the other 3 clusters (cluster 2 [n = 18,655, 38.8%]: hazard ratio [HR], 1.07; 95% CI, 1.02–1.12; cluster 3 [n = 10,244, 21.3%]: HR, 1.20; 95% CI, 1.14–1.26; and cluster 4 [n = 14,014, 29.1%]: HR, 1.44; 95% CI: 1.37–1.50). Similar trends in risk were observed for composite recurrent stroke and all-cause mortality outcome, and subsequent recurrent stroke outcome. However, results were not consistent for subsequent risk in CHD, PVD, heart failure, CVD-related mortality, and all-cause mortality. In this proof of principle study, we demonstrated how a heterogenous population of patients with incident stroke can be stratified into four relatively homogenous phenotypes with differential risk of recurrent and major cardiovascular outcomes. This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes.


Introduction
Stroke is a leading cause of death and disability globally with a substantial economic cost due to treatment and post-stroke care [1].Patients at time of incident stroke have varied clinical characteristics, demographics, and biochemical profiles.This heterogeneity in characteristics at time of incident stroke impacts on cardiovascular morbidity and mortality outcomes [2].Phenotyping (subgrouping) people after incident stroke, in terms of the risk of various cardiovascular outcomes, could provide individuals with the poorest prognosis better care.Intensive secondary prevention strategies including the use of novel medications such as proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors and colchicine in patients at very high risk of adverse cardiovascular morbidity and mortality outcomes.
Cluster analysis, a hypothesis-free unsupervised machine learning data-driven approach, has been widely used to analyse clinical data to identify new phenotypic subgroups of complex and heterogeneous diseases including obstructive sleep apnoea [3], asthma [4,5], chronic obstructive pulmonary disease, chronic heart failure [6], dilated cardiomyopathy [7], sepsis [8], Parkinson's disease [9], breast cancer [10], and diabetes [11].This approach does not include outcome data, and may be less biased in its results, especially when using retrospectively collected data.Clustering of clinical data may, therefore, be helpful in identifying subgroups of patients with incident stroke and generating new hypotheses.Efforts to determine such phenotypic groups in patients with incident stroke remain limited.
Using a large population-based cohort of adult patients with incident stroke, the objectives of this study are: (i) to identify patterns in linked primary and secondary clinical data and cluster patients based on phenotypic similarities; (ii) to assess the association between phenotypic clusters and subsequent recurrent stroke or CVD-related mortality, recurrent stroke or allcause mortality, coronary heart disease (CHD), recurrent stroke, peripheral vascular disease (PVD), heart failure, CVD-related mortality, and all-cause mortality.

Study design and data source
This prospective population-based cohort study used the UK Clinical Practice Research Datalink (CPRD) GOLD database of anonymised longitudinal primary care electronic health records [12], linked to secondary care hospitalisation data (Hospital Episode Statistics [HES]) [13], national mortality data (Office for National Statistics [ONS]) [14], and social deprivation data (Index of Multiple Deprivation (IMD) 2015) [15].Patients included in the CPRD GOLD database, from a network of general practices across the UK, are representative of the UK general population in terms of sex, age, and ethnicity [12].

Study population
We identified a cohort of patients with incident non-fatal stroke in either primary care (CPRD GOLD) or secondary care (HES) between 1 January 1998 and 31 December 2017.Details about this cohort were previously reported [16].Patients with a prior record of coronary heart disease (CHD), peripheral vascular disease (PVD), or heart failure before incident stroke event were excluded.Patients were followed from the date of incident stroke diagnosis until they developed a major adverse cardiovascular event (MACE), died, ceased contributing data, or last data collection date of the practice.The study flow diagram is shown in Fig 1.

Outcomes
The primary outcome was a composite of recurrent stroke or CVD-related mortality event recorded after incident stroke from across the linked data sources (CPRD, HES or ONS registry).The secondary outcomes included: CHD, recurrent stroke, PVD, heart failure, CVDrelated mortality, all-cause mortality, and the composite of recurrent stroke or all-cause mortality.
Subsequent outcomes within 30 days were considered to be representing or relating to the incident stroke event [16].Analyses were, therefore, restricted to patients with subsequent outcomes occurring after 30 days of incident stroke.

Potential candidate variables for phenotyping
Based on availability in the electronic health records and established association with CVD, 336 candidate variables were selected.These included demographic data, vital signs, biochemical parameters, comorbid conditions, and prescribed medications (S1 Table ).For vital signs and biochemical test results, the most recent values/records within 24 months before incident stroke were extracted.A prescription within 12 months before incident stroke was considered as a medication prescribed.All comorbid conditions were defined based on the latest record of a comorbid condition any time before incident stroke.All code lists used have been published and available for download [17,18].

Data processing
The variable distributions and missingness were first assessed.Multiple imputation by chained equations was used to account for missing data (S1 Fig, S2 Table ).Ten imputed datasets were generated, using all available covariates and all the outcomes, although outcomes were not imputed [19,20].The imputed datasets were pooled into a single dataset using Rubin's rules [21].A high number of dimensions from a dataset with many variables/features is associated with a loss of meaningful differentiation between similar and dissimilar individuals-the 'curse of dimensionality' [22].To improve the cluster analysis process and performance, feature selection was carried out to reduce collinearity, conditional dependence and noise contributing to increasing the variance.Feature selection was based on two (2) widely used data-driven feature selection methods (Boruta [23] and Least Absolute Shrinkage and Selection Operator (Lasso) regression [24]-S2 Fig) and clinical expert consensus.An expert group of clinicians from both primary (Consultant General Practitioners-NQ, JK) and secondary care (Stroke Medicine Consultant/Specialist-GN, GG) were independently consulted to attain consensus on which variables to select for the cluster analysis.Clinical expert consensus was defined as a 75% (3 out of 4) agreement among the clinical experts on each variable.49 variables were rated important by the clinical experts and at least 1 of the 2 data-driven methods-S1 Table .After evaluating correlation among the 49 selected variables using mixedCor and Lares functions in R for mixed-type data (S3

Phenotypic clustering
The prediction strength method by Tibshirani and Walther, 2015 [25] in the kamila function and the Elbow method were used to select the optimal number of clusters-S5 Fig.The kamila algorithm for mixed data clustering (S1 Text) was implemented to identify distinct patient phenotypic clusters.To ensure robustness of the clusters identified, 1,000 initialisations (that is, random starting points) were carried out.Plot of the clusters with the principal component analysis (PCA) dimensions was generated (S6 Fig).
Using the h2o package (http://www.h2o.ai), a gradient boosting model was applied to identify as well as rank the key covariates (candidate variables) that predict each of the identified phenotypic clusters.The respective cluster groupings were coded as 1 -belonging to cluster or 0 -belonging to other clusters.SHAP (SHapley Additive exPlanations) was used to assess the discriminative influence of the variables for each of the identified clusters [26].

Statistical analysis
For each cluster descriptive characteristics were provided, reporting proportion (%) for categorical variables and mean (SD) or median (IQR) for continuous variables.Kruskal-Wallis and chi-squared tests were used to compare across clusters, for continuous and categorical data, respectively.
The association between phenotypic clusters and adverse cardiovascular morbidity and mortality outcomes were assessed using Cox proportional hazards regression model.The hazard ratio (HR) for each phenotypic group is presented with 95% confidence intervals (CI) and corresponding p-values.Cumulative incidence plots were derived and differences between phenotypic groups assessed by the log-rank test.All statistical analyses were performed using Stata SE version 17 (StataCorp LP) and R version 4.1.0.An alpha level of 0.05 was used.

Ethics approval and consent to participate
Ethical approval for this study was obtained from the Independent Scientific Advisory Committee (ISAC)-study protocol number 19_023R.De-identified (anonymised) patient data was obtained from the CPRD hence this study was exempt from obtaining informed consent from patients.

Clinical characteristics among phenotypic clusters
We identified 68,642 patients aged �18 years old with any incident non-fatal stroke event between 1998 and 2017.A total of 20,528 ( occurring within 30 days of incident stroke event were excluded, as these outcomes were considered to be related to the incident stroke event [16].Cluster analysis was performed in the remaining 48,114 patients.Four phenotypic clusters with significant differences in clinical characteristics were identified.The identified clusters were numbered from 1 to 4 according to the ascendent overall incidence of subsequent composite outcome of recurrent stroke or CVD-related mortality, the primary outcome.Table 1 describes

Variable importance for clusters
The supervised gradient boosting model to identify key covariates (candidate variables) that predict the respective phenotypic cluster had excellent prediction accuracy-area under the receiver operative curve (AUC) of 0.985, 0.982, 0.974, and 0.970 for clusters 1, 2, 3 and 4, respectively.The most common variables for predicting the respective phenotypic clusters were age at incident stroke, blood pressure, hypertension, LDL cholesterol, and potency of prescribed statin-Fig 2.
For risk of subsequent heart failure, CVD-related mortality and all-cause mortality, cluster 2 had a significantly decreased risk when compared to cluster 1 while clusters 3 and 4 had a significantly increased risk-Table 2. The occurrence of subsequent cardiovascular morbidity and mortality outcomes across the different phenotypic clusters is presented as Kaplan Meier plots in Fig 4.

Discussion
This population-based study exploring phenotypic characteristics of patients with incident stroke using a data-driven-cluster analysis approach identified four clinically meaningful patient clusters based on the phenotypic characteristics at time of incident stroke.There was a varied relationship between the identified phenotypic clusters and subsequent risk of adverse cardiovascular morbidity and mortality outcomes.
In our study, four distinct and clinically meaningful phenotypic clusters were identified.Smoking, a strong independent modifiable risk factor for cardiovascular morbidity and mortality outcomes [27], was most highly prevalent in clusters 1 and 2. Preventative strategy to communicate the risks of smoking and the benefits of quitting to this cluster of patients could be an effective means to promote smoking cessation and reduce risk for subsequent adverse  [28].With the exception of clusters 2, the 3 other clusters included had high prevalence of multiple long-term conditions as well as CVD risk factors at time of incident stroke.Patients with incident stroke have been shown to commonly have pre-existing long-term conditions [29].To optimally manage the possible atherogenic effect of these comorbid condition to reduce risk of subsequent cardiovascular morbidity and mortality outcomes, both non-pharmacological (that is, lifestyle modification [30,31]) and pharmacological (antihypertensives for blood pressure management [32]; lipid-lowering medications as statins for cholesterol management [33]; antidiabetics for blood sugar control [30]; and antiplatelets/anticoagulants to manage arrhythmia [34]) strategies need to be prioritised in line with clinical guidelines [35].Frequent monitoring/reviews to ensure treatment targets are being met is important [36].Age, a non-modifiable risk factor, was a key factor for the patient cluster membership.Among older adults (typical of cluster 4), incidence of aortic disease, PVD and venous increase as age-related alterations in vascular structure and function are compounded by the longer exposure to CVD risk factors [37].
Clustering is a common approach used to analyse large datasets, to identify both the number of subgroups in the data and the attributes of each subgroup, as has been done in this study.Data analysed in real applications healthcare (from electronic health records) are mostly characterised by a mix of continuous and categorial variables.More common approaches that have been applied to mixed data include converting the variables to a single data type by either coding the categorical variables as numbers or dummy coding the variables and then applying standard distance methods such as k-means designed for continuous variables to the transformed data to achieve the clustering objective(s) [38,39].Continuous variables have also been converted to categorical variables using interval-based bucketing [40,41].Similarities that may have been observed in the original data may be lost when the data is transformed in such ways [40].Kamila clustering algorithm has, however, been shown to better handle high imbalance between continuous and categorical data than any other method [40,42].From a computational perspective, when compared with other algorithms, the Kamila algorithm offers the best performance and most time-efficient when dealing with large datasets (in relation to both observations and variables) in the setting of heterogeneous data, as was the situation in our study [40,42].

Strengths and limitations
To our knowledge, this is the first time that a data-driven cluster analysis aimed at identifying stroke phenotypes in a well characterised large population-based cohort of adults with any incident stroke.This allows us to cover a large range of stroke phenotypes.Most importantly, we had a comprehensive linked database with a broad spectrum of clinical data with many of these variables being explored in cluster analysis for the first time.There are, however, limitations of this study worth considering.First and foremost, the study was not meant to propose a new classification for stroke, because the clusters are likely vary according to patient characteristics and available data.These results serve to underscore the need for novel multidimensional stroke classification approaches for improving patient care.Furthermore, they are aimed to generate hypotheses for future studies that will integrate clinical and biological data in patients, with the goal of improving the care of patients with stroke.With immense advancement in machine learning, cluster analysis can be performed in a large number of ways [42,43].However, the knowledge and experience of the relevant experts remain the best judge in the interpretation of findings from cluster analysis, hence the involvement of a diverse group of clinical specialists, clinical researchers, and data experts in our study.The presence of missing data is a common occurrence in clinical research using electronic health records collected as part of routine care.For example, laboratory tests are typically requested only when considered necessary for a patient's health condition.Similarly, information on BMI or smoking status may not be consistently recorded, leading to potential bias in patterns of data completeness.To address this issue, multiple imputation by chained equations, as outlined in the methods section, was used to handle missing data in our study, which is the preferred option under any missingness mechanism [19,20].

Implications
Cluster analysis is most suited to address the multidimensional complexity of disease conditions with considerable heterogeneity such as stroke.Population-based cluster analysis could provide further understanding of disease patterns.Additionally, patients could be phenotyped and allocated to specific clusters that could be associated with different risks for various outcomes.Different treatment strategies or interventions could be targeted at specific phenotypic clusters, based available evidence on risk and possible response.Future clinic trial design could also focus on high-risk clusters or focus on specific aspects within a cluster.

Conclusions
Using an unsupervised learning data-driven cluster analysis on a broad spectrum of baseline clinical data of patients with incident stroke, we identified four phenotypic and clinically meaningful clusters with respect to risk of subsequent major adverse outcomes.These findings highlight the significant heterogeneity that exists within patients with incident stroke with respect to subsequent adverse outcomes.This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes.Further exploration in different patient cohorts and populations is needed.
and compares the clinical characteristics among the phenotypic clusters.The plots of the clusters are shown with the principal component analysis (PCA) dimensions in S6 Fig.The cluster profiles are summarised in Box 2.

Fig 2 .
Fig 2. Plot showing the clinical parameters which are the core of each phenotypic cluster.aki: acute kidney injury; dbp: diastolic blood pressure; dm_eye_comp: diabetic ophthalmic complications; sbp: systolic blood pressure; gfr: glomerular filtration rate; hb: haemoglobin; hdl: high-density lipoprotein cholesterol; ldl: low-density lipoprotein cholesterol; hba1c: glycated haemoglobin; nonRH_aortic: non-rheumatic aortic valve disorder; smi: severe mental illness; tg: triglyceride; tia: transient ischaemic attack.SHAP summary plot combines feature/variable importance with feature effects.Each point on the summary plot is a Shapley value for an individual.The position on the y-axis is determined by the feature and on the x-axis by the Shapley value.The colour represents the value from low to high.The features are ordered according to importance.https://doi.org/10.1371/journal.pdig.0000334.g002

Table 1 . Characteristics of study population at time of incident stroke according to cluster membership (n = 48,114).
-1.37; cluster 4: HR, 1.54; 95% CI: 1.48-1.60)andrecurrentstrokeoutcomeBox2. Summary of cluster profilesMedian age of 68 years (IQR 60-76), with a high proportion of patients who smoke or have diagnosed alcohol problems.Predominantly higher prevalence of CHD-related comorbidities/risk factors at time of incident stroke-high BMI (overweight/obese), diabetes, dyslipidaemia, hypertension, and family history of CVD.Higher proportion of antidiabetic and antihypertensive prescriptions.Median age of 67 years (IQR 56-76), with lower prevalence of comorbid conditions at time of incident stroke.Higher proportion of smokers and patients with alcohol problems.Lowest proportion of prescribed medications.Median age of 79 years (IQR: 73-85) with the highest prevalence of multiple long-term conditions at time of incident stroke-arrhythmia, cancer, chronic kidney disease, dementia, dyslipidaemia, hypertension, renal disease, and transient ischaemic attach.